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Abstract 

Much of the past work in network analysis has focused on 
analyzing discrete graphs, where binary edges represent the 
"presence" or "absence" of a relationship. Since traditional 
network measures (e.g., betweenness centrality) utilize a dis- 
crete link structure, complex systems must be transformed 
to this representation in order to investigate network prop- 
erties. However, in many domains there may be uncertainty 
about the relationship structure and any uncertainty informa- 
tion would be lost in translation to a discrete representation. 
Uncertainty may arise in domains where there is moderating 
link information that cannot be easily observed, i.e., links be- 
come inactive over time but may not be dropped or observed 
links may not always corresponds to a valid relationship. In 
order to represent and reason with these types of uncertainty, 
we move beyond the discrete graph framework and develop 
social network measures based on a probabilistic graph rep- 
resentation. More specifically, we develop measures of path 
length, betweenness centrality, and clustering coefficient — 
one set based on sampling and one based on probabilistic 
paths. We evaluate our methods on three real- world networks 
from Enron, Facebook, and DBLP, showing that our proposed 
methods more accurately capture salient effects without being 
susceptible to local noise, and that the resulting analysis pro- 
duces a better understanding of the graph structure and the 
uncertainty resulting from its change over time. 



Introduction 

Much of the past work in network analysis has focused 
on analyzing discrete graphs, where entities are repre- 
sented as nodes and binary edges represent the "pres- 
ence" or "absence" of a relationship between entities. 
Complex systems of relationships are first transformed to 
a discrete graph representation (e.g., a friendship graph) 
and then the connectivity properties of these graphs 
are used to investigate and understand the characteris- 
tics of the system. For example, network measures such 
as the average shortest path length and clustering coef- 
ficient have been used to explore the properties of bio 
logical and information networks (Watts and Strogatz 1998 



Leskovec, Kleinberg, and Faloutsos 2005 1, while measures 
such as centrality have been used for determining the 



most important and/or influential people in social networks 
dFreeman 1977IIBonacich 19871 >. 

The main limitation of measures defined for a discrete 
representation is that they cannot easily be applied to rep- 
resent and reason about uncertainty in the link structure. 
Link uncertainty may arise in domains where graphs evolve 
over time, as links observed at a earlier time may no longer 
be present or active at the the time of analysis. For exam- 
ple in online social networks, users articulate "friendships" 
with other users and these links often persist over time, re- 
gardless of whether the friendship is maintained. This can 
result in uncertainty about whether an observed friendship 
link is still active at some later point in time. In addition, 
there may be uncertainty with respect to the strength of the 



articulated relationships (Xiang, Neville, and Rogati 20101 
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which can result in uncertainty about whether an observed 
relationship will be used to transmit information and/or in- 
fluence. Furthermore, there are other network domains (e.g. , 
gene/protein networks) where relationships can only be in- 
directly observed so there is uncertainty about whether an 
observed edge (e.g., protein interaction) actually indicates 
the presence of a valid relationship. 

In this work, we formulate a probabilistic graph represen- 
tation to analyze domains with these types of uncertainty 
and develop analogues for three standard discrete graph 
measures — average shortest path length, betweenness cen- 
trality, and clustering coefficient — in the probabilistic set- 
ting. Specifically, we use probabilities on graph edges to 
represent link uncertainty and consider the distribution of 
possible (discrete) graphs that they define, then we develop 
measures that consider the properties of the graph popula- 
tion defined by this distribution. 

Our first set of measures compute expected values over 
the distribution of graphs, sampling a set of discrete graphs 
from this distribution in order to efficiently approximate the 
path length, centrality, and clustering measures. We then de- 
velop a second set of measures that can be directly computed 
from the probabilities, which removes the need for graph 
sampling. The second approach also affords us the oppor- 
tunity to consider more than just shortest paths in the net- 
work. We note that previous focus on shortest paths is due 
in part to an implicit belief that short paths are more likely 
to result in successful transfer of information and/or influ- 
ence between two nodes. This has led other works to gener- 



alize shortest paths to the probabilistic domain for their own 
purposes (Potamias et al. 2009). However, in a probabilistic 
framework we can also directly compute the likelihood of a 
path and consider the most probable paths, which are likely 
to facilitate information flow in the network. 

With probabilistic paths, we also introduce a prior to in- 
corporate the belief that the probability of successful in- 
formation transfer is a function of path length — since the 
existence of a relationship does not necessarily mean that 
information/influence will be passed across the edge. This 
formulation, which models the likelihood of information 
spread throughout the graph, is consistent with the finding in 
dOnnela et al. 200 7). which identified that constricting and 
relaxing the flow along the edges in the network was neces- 
sary to model the true patterns of information diffusion in an 
evolving communication graph. 

We evaluate our measures on three real world networks: 
Enron email, Facebook micro communications, and DBLP 
coauthorships. In these datasets, the network transactions 
are each associated with timestamps (e.g., email date). Thus 
we are able to compute the local (node-level) and aggregate 
(graph-level) measures at multiple time steps, where at each 
time step t we consider the network information available 
up to and including t. We compare against two different ap- 
proaches that use the discrete representation: an aggregate 
approach, which unions all previous transactions (up to t) 
into a discrete graph, and a slice approach, where only trans- 
actions from a small window (i.e., [t — S,i\) are included 
in the discrete representation. For our methods, we estimate 
edge probabilities from the transactions observed up to t, 
weighting each transaction with an exponential decay func- 
tion. Our analysis shows that our proposed methods more ac- 
curately capture the salient changes in graph structure com- 
pared to the discrete methods without being susceptible to 
local, temporal noise. Thus the resulting analysis produces 
a better understanding of the graph structure and its change 
over time. 

Related Work 

The notion of probabilistic graphs have been studied pre- 
viously, notably by dFrank 19691 , flHua and Pei 20101 ) and 
dPotamias et al. 2009 ). ( Frank 19691 showed how for graphs 
with probability distributions over the weights for each 
edge, Monte Carlo methods can be used to sample to de- 
termine the shortest path probabilities between the edges. 
dHua and Pei 20101) then extends this to find the shortest 
weighted paths most likely to complete within a certain 
time constraint (e.g., the shortest distance across town in 
under half an hour). In dPotamias et al. 2009), the most 
probable shortest paths are used to estimate the fc-nearest 
neighbors in the graph for a particular node. Although 
dPotamias et al. 20091 draws sample graphs based on likeli- 
hood (i.e., sampling each edge according to its probability), 
in their estimate of the shortest path distribution they weight 
each sample graph based on its probability, which is incor- 
rect unless the samples are drawn uniformly at random from 
the distribution. In this work, we sample in the same manner 
as dPotamias et al. 2009), but weight each sample uniformly 
in our expectation calculations — since, when the graphs are 



drawn from the distribution based on their likelihood, the 
graphs with higher likelihood are more likely to be sampled. 

There has also been some recent work that has devel- 
oped measures for time-evolving graphs, e.g., to identify 
the most central nodes throughout time ( |Tang et al. 2010| ) 
and identify the edges that maximize communication 
overtime (Kossinets, Kleinberg, and Watts 2008). However, 
these works fail to account for the uncertainty in both the 
link structure and the the communication across links (as 
users are unlikely to propagate all information across a sin- 
gle edge). Our use of a probabilistic graph framework and 
transmission prior address these two cases of uncertainty. 

Sampling Probabilistic Graphs 

Let G = (V,E), be a graph where V is a collection 
of nodes and E G V x V is the set of edges, or re- 
lationships, between the nodes. In order to represent and 
reason about relationship uncertainty, we associate each 
edge eij (which connects node Vi and Vj) with a prob- 
ability P(eij). Then we can define Q to be a distribu- 
tion of discrete, unweighted graphs. Assuming indepen- 
dence among edges, the probability of a graph G G Q 
is: P(G) = II, ./,/';<, ; HI. , r [1 - Pifiij)]- Since we 
have assumed edge independence, we can sample a graph 
Gs from Q by sampling edges independently according to 
their probabilities P(eij). Based on this, we can develop 
methods to compute the expected shortest path lengths, be- 
tweenness centrality rankings, and clustering coefficients us- 
ing sampling. 

Probabilistic Average Shortest Path Length Let p l3 = 

{vk ± , Vk 2 , Vk q } refer to a path of q vertices connecting 
two vertices vi and Vj, i.e., Vk 1 = i'i and tit = Vj, and from 
each vertex to the next there exists an edge: e^ i k i+1 G E 
for i = — 1]. Let V(pij) and E(pij) refer to the 

set of vertices and edges respectively, in the path and let 
\pij\ = \E(pij)\ refer to the length of the path. Assum- 
ing connected graphs, for every unweighted graph G = 
(V, E) G Q there exists a shortest path pl" m between ev- 
ery pair of nodes Vi,vj G V. Letting SPy = |p™ m |, we 
can then define the average shortest path length in G as: 
SP(G) = 



|V|-(]V|-i) J2iev Sje\ 
Now, when there is uncertainty about the edges in G, we 
can compute the expected average shortest path length by 
considering the distribution of graphs Q. For any reasonable 
sized graph, the distribution Q will be intractable to enumer- 
ate explicitly, so instead we sample from Q to approximate 
the expected value. More specifically, we sample a graph 
G s by sampling edges uniformly at random according to 
their edge probabilities P(eij). Each graph that we sample 
in this manner has equal likelihood, thus we can draw m 
sample graphs Gs = {Gi, ...,G m } and calculate the ex- 
pected shortest path length with the following: 



lg [ SP ] = Y, SP(G) • P(G) ~ - ]T SP(G m ) 



(1) 



Gee 



Since the sampled graphs are jinweighted, it takes 
0(|V||.E|) time to compute SP for each sam- 



pie dBrandes 200 H . This results in an overall cost of 
O (m ■ \V\ \E\) to compute Eg [SP] . 

Sampled Centrality Betweenness centrality for a node Vi 
is defined to be the number of shortest paths between other 
pairs of nodes which pass through vf. BCi — \{p™ n £ 
G : Vi £ V(pjk) A i 7^ j,k}\. Vertices that contribute 
to the existence of many shortest paths will have a higher 
BC score than other nodes that contribute to fewer shortest 
paths, thus BC is used a measure of importance or centrality 
in the network. It is difficult to directly compare BC values 
across graphs since the number of shortest paths varies with 
graph size and connectivity. Thus, typically analysis focuses 
on betweenness centrality rankings (BCR), where the nodes 
are ranked in descending order of their BC scores and the 
node with the highest BC score is given a BCR of 1 . 

As discussed above, we can compute the shortest paths 
for each unweighted graph G £ Q, then we can also com- 
pute the BCR values for each unweighted graph G £ Q. We 
denote BCRi(G) as the betweenness centrality ranking for 
node Vi in G. Then we can approximate the expected BCR 
for each node by sampling a set of m graphs from Q: 



E, 



; [BCR ? ] --VBCR^G™) 

m ^ — ' 



(2) 



Again, since the sampled graphs are unweighted, it 
takes 0(|y||i?|) time to compute the BCR for each 
sample (Bran des 200 It , resulting in an overall cost of 
0(m-\V\\E\). 

Sampled Clustering Coefficients Clustering coefficient 
is a measure of how the nodes in a graph clus- 
ter together (Wa tts and Strogatz 1998) . For a node Vi 



with Nj 



{ v jiT--,Vj n \ neighbors (e.g.. 



■ m 



e 



E), its clustering coefficient is defined as CQ = 

M(M-i) S^eJV, 52v k e Ni ,k& M e jfc). where l E is an 
indicator function which returns 1 if Vj is connected to Vk- 
CC can be thought of as the fraction connected pairs of 
neighbors of i>j. We denote CQ(G) as the clustering coeffi- 
cient for node Vi in graph G. Similar to paths, we can com- 
pute clustering coefficients for every graph G £ Q. Thus we 
can approximate the expected CC for each node by sampling 
a set of m graphs from Q : 



,[CQ] ~ - VCQ(G m ) 

TD * 



(3) 



Under the assumption that the maximum degree in the graph 
can be bounded by a fixed constant (which is typical for 
sparse social networks), we can compute the clustering coef- 
ficient for a single graph in 0(|y|) time (i.e., 0(1) for each 
node), which results in an overall cost of 0(m ■ \ V\). 

Probabilistic Path Length 

In the previous section, we discussed how to extend the dis- 
crete notions of shortest paths and centrality into a prob- 
abilistic graph framework via expected values, and we 
showed how to estimate approximate values using sampling. 



While our sampling-based measures are valid and give in- 
formative results (see section 6 for details), they have two 
limitations which restrict their applicability. 

First, the effectiveness of the approximation depends 
on the number of samples from Q. We note that 
dPotamias et al. 2009] ) used a Hoeffding Inequality to show 
that relatively few samples are needed to compute an accu- 
rate estimate of independent shortest paths in probabilistic 
graphs. However, since our the calculation of BCR is based 
on the joint occurrence of shortest paths in the graph, this 
bound will not hold for our measures. 

Second, since the expectation is over possible worlds (i.e., 
G £ Q), the focus on shortest paths may no longer be the 
best way to capture node importance. We note that in the 
discrete framework, where all edges are equally likely, the 
use of shortest paths as a proxy for importance implies a 
prior belief that shorter paths are more likely to be used suc- 
cessfully to transfer information and/or influence in the net- 
work. In domains with link uncertainty, the flow of infor- 
mation/influence will depend on both the existence of paths 
in the network and the use of those paths for communica- 
tion/transmission. In a probabilistic framework, we have an 
opportunity to explicitly incorporate the latter, by encoding 
our prior beliefs about transmission likelihood into mea- 
sures of node importance. Furthermore, although a prob- 
abilistic representation enables analysis of more than just 
shortest paths, as we note above, even to capture shortest 
paths the sampling methods described previously may need 
many samples to accurately estimate the joint existence of 
shortest paths. Thus, a measure that explicitly uses the edge 
probabilities to calculate most probable paths may more ac- 
curately highlight nodes that serve to connect many parts of 
the network. We discuss each of these issues more below. 

Most Probable Paths To begin, we extend the notion 
of discrete paths to probabilistic paths in our framework. 
Specifically, we can calculate the probability of the exis- 
tence of a path pij as follows (again assuming edge inde- 
pendence): P(pij) = Y[ euv £E( Pij ) p ( e uv)- Using the path 
probabilities, we can now describe the notion of the most 
probable path. Given two nodes Vi,Vj, the most proba- 
ble path path is simply the one with maximum likelihood: 
Pij L = argmax P(pij). We can compute the most likely 
paths in much the same way that shortest paths are computed 
on weighted discrete graphs, by applying Dijkstra's short- 
est path algorithm, but instead of expanding on the shortest 
path, we expand the most probable path. Thus, all most prob- 
able paths can be calculated in O (\V\ \E\ + |U| 2 log|V^ 

Transmission Prior Previous focus on shortest paths for 
assessing centrality points to an implicit assumption that 
if an edge connects two nodes that it can be successfully 
used for transmission of information and/or influence in the 
network. Although there has been work both in maximiz- 
ing the spread of information in a network through the use 
of central nodes (Boragatti 2005 INewman 2005ft and in the 



study of information propagation through the use of trans- 



mission probabilities (Goldenberg, Libai, and Muller 2001 



there has been little prior work that has incorporated trans- 



mission probabilities into node centrality measures. Cen- 
trality measures based on random walks and eigenvec- 
tors (INewman 20 05 ) implicitly penalize longer paths as they 
consider all paths between nodes in the network. However, 
in our framework we can incorporate transmission probabili - 
ties to penalize the probabilities of longer paths in the graph, 
in order to more accurately capture the role nodes play in the 
spread of information across multiple paths in the network. 

Consider the case where there is one path of nine peo- 
ple where each edge has high probability of existence (e.g., 
0.95) and another path of three people where the edge prob- 
abilities are all moderate (e.g., 0.70), both ending at node 
V. Here, the longer path is more likely to exist than the 
shorter path, but in this example we are more interested in 
which path is used to transfer a virus to v. Even when an 
edge exists (i.e., the relationship is active), the virus will not 
be passed with certainty to the next node, thus the trans- 
mission probability is independent of the edge probability. 
Moreover, when the transmission probability is less than 1, 
it is more likely that the virus will be transmitted across the 
shorter path, since the longer path presents more opportuni- 
ties for the virus to be dropped. This provides additional in- 
sight as to why shortest paths have always been considered 
important — there is generally a higher likelihood of trans- 
mission if it is passed through fewer nodes in the network. 

To incorporate transmission likelihood into our proba- 
bilistic paths, we assign a probability j3 of success for every 
step in a particular path — corresponding to the probability 
that information is transmitted across an edge and is received 
by the neighboring node. If we denote I to be the length of 
a path p, and s to be the number of successful transmissions 
along the path, we can use a binomial distribution to repre- 
sent the transmission probability across p with: 

SBin(s|/3) = Bin(s = l\l,f3) = (3 l 

Here SBin corresponds to the case where the transmission 
always succeeds (i.e., across all edges in p). Using this bi- 
nomial distribution as a prior allows us to represent the ex- 
pected probability of information spread in an intuitive man- 
ner, giving us a parameter /? which we can adjust to fit 
our expectations for the information spread in the graph. 
Note that setting /3 = 1 is equivalent to the most proba- 
ble paths discussed earlier. The prior effectively handicaps 
longer paths through the graph. Although, there is a correla- 
tion between shortest (certain) paths and handicapped (un- 
certain) paths, these formulations are not equivalent, since 
the latter produces a different set of paths when the shortest 
paths have low probability of existence. 

ML Handicapped Paths Now that we have both the no- 
tion of a probabilistic path, and an appropriate prior for 
modeling the probability of information spreading along the 
edges in the path, we can formulate the maximum likelihood 
handicapped path between two nodes Vi and Vj to be: 

argmax [P( Pij ) ■ SBin( | Pij | | /?)] (4) 

To compute the most likely handicapped (MLH) paths, we 
follow the same formulation as the most probable paths, 
keeping track of the path length and posterior at each point. 



MLH 
Pij 



In the MLH formulation, probable paths are weighted by 
likelihood of transmission, thus nodes that lie on paths that 
are highly likely and relatively short, will have a high BC 
ranking. To calculate BCR ranking based on MLH paths, 
we can use a weighted betweenness centrality algorithm. 
Specifically, we modify Brandes' algorithm (Br andes 20011 1 
to start with the path that has the lowest probability of occur- 
rence to be the one to backtrack from, enabling computation 

of the betweenness centrality in O (jV\ \E\ + \V\ 2 \og\V\ 

Comparison with Discrete Graphs 

The formulation of MLH Paths has inherent benefits, most 
notably with its direct connection to the previously well- 
studied notions of shortest paths and betweenness central- 
ity in discrete graphs. In fact, we can view a discrete graph 
G as being a special case of probabilistic graph with edge 

probabilities: 

. . [1 if an edge exists 
P(eij) = < 

[0 if the edge does not exist 



(5) 



We denote the distribution of graphs defined by these prob- 
abilities as Q\. Note that the only graph in Q\ with non-zero 
probability is G — since if an edge exists in a discrete graph, 
then it exists with complete certainty, likewise, if an edge is 
not present, we are certain it does not exist, thus P(G) = 1. 

Theorem 1. For every pair of nodes Vi and Vj, the shortest 
path in the discrete graph (p^ G G) is equal to the most 
probable path discovered by the MLH algorithm (pfj LH G 
Gi),for0 </3<l. 

Proof. In Qi every P(eij) is either 1 or 0, thus every case 
where P(pij) > is precisely P(pij) — 1. If we choose 
the shortest path from the discrete graph, it will have length 
I* = \ P ij\, and the MLH probability for the same path will 
be j3 l . Clearly, if a longer path were chosen by MLH, its 
probability would be less than /3 , and we know that no 
shorter paths exist — since all paths shorter than p.^ would 
involve an edge than did not exist in G and thus would have 
probability . □ 

Corollary 1. The betweenness centrality using shortest 
paths on a discrete graph G can be equivalently calculated 
with most probable handicapped paths over Q\, where edge 
probabilities are defined by Equation^} 



Proof This follows directly from Thm 1 . 



□ 



Probabilistic Clustering Coefficient 

We now outline a probabilistic measure of clustering 
coefficient that can be computed without the need for 
sampling. If we assume independence between edges, 
the probability of a triangle's existence is equal to the 
product of the probabilities of the three sides. The ex- 
pected number of triangles is then the sum of the tri- 
angles probabilities that include a given node Uj. De- 
noting Tr; to be the expected triangles including vf. 
Ee [Tr,] =E Vj , Vk em^jP tei) • P (ew) • P (e jk )]. De- 
noting Coj to be the expected combinations (i.e., coexist- 
ing pairs) of the neighbors of vi, we then get: Kg [Co.;] = 



T,v j ,v k €Ni,v j ^v k i P ( e v) - p ( e ki)}- We can then define the 
probabilistic clustering coefficient to be the expectation of 
the ratio Tr^/Coi, and approximate it via a first order Taylor 
expansion (lElandt- Johnson and Johnson 198Q1 >: 



CQ 



Er 



11l 

Co, 



Eg [Co, 



(6) 



Assuming again that the maximum degree in the graph 
can be bounded by a fixed constant, we can compute the 
probabilistic clustering coefficient in 0(\ V\) time (O(l) for 
each node). Additionally, the probabilistic approximation to 
the clustering coefficient shares connections with the tradi- 
tional clustering coefficients on discrete graphs. 

Theorem 2. The probabilistic clustering coefficients com- 
puted in Q\, with probabilities defined by \5\for a discrete 
graph G, are equal to the discrete clustering coefficients cal- 
culated on G. 



Proof. Any triangle from G has probability 1 in Q\, while 
any non-triangle in G clearly has probability 0. The same is 
true for the combinations of pairs of neighbors. As such, the 
sums of the numerators and denominators will be equal for 
both clustering coefficient. □ 

Experiments 

To investigate the performance of our proposed MLH and 
sampling methods for average path length, betweenness 
centrality and clustering coefficient, we compare to tradi- 
tional baseline social network measures on data from Enron, 
DBLP, and Facebook. These datasets all consist of time- 
stamped transactions among people (e.g., email, joint au- 
thorship). We will use the temporal activity information to 
derive probabilities for use in our methods, and evaluate our 
measures at multiple time steps to show the evolution of 
measures in the three datasets. 

Datasets 

For our analysis we first use the Enron dataset 
(Shetty and Adibi 2004 1. The advantage to this dataset 



is that it allows us to understand the effects of our prob- 
abilistic measures because key events and central people 
have been well documented ((Marks}. We consider the 
subset of the data comprised of the emails sent between 
employees, resulting in a dataset with 50,572 emails among 
151 employees. 

Our second dataset is a sample from the DBLP computer 
science citation database. We considered the set of authors 
who had published more than 75 papers in the timeframe 
1967-2006, and the coauthor relationships between them. 
The resulting subset of data consisted of 1,384 nodes, with 
23,748 co-authors relationships. 

Our third dataset is from the Purdue University Facebook 
network. Specifically we consider one year's worth of wall- 
to-wall postings between users in the class of 2011 subnet- 
work. The sample has 2,648 nodes with 59,565 messages. 



Methodology 

We compare four network measures for each timestep t in 
each dataset. When evaluating at time t, each method is able 
to utilize the graph edges that have occurred up to and in- 
cluding t. As baselines, we compare to (1) an aggregate 
method, which at a particular time t computes standard mea- 
sures for discrete graphs (e.g., BCR) on the union of edges 
that have occurred up to and including t, and (2) a time slice 
method, which again computes the standard measures, but 
only considers the set of edges that occur within the time 
window [t — S,t]. For the Enron and Facebook, we used 
(5 = 14 days and for DBLP, we considered (5 = 1 year. 

We then compare to the sampling and MLH measures. 
For both the probabilistic methods, we need a measure of 
relationship strength to use as probabilities in our model. 
Although any notion of relationship strength can be substi- 
tuted at this step, in this work we utilize a measure of re- 
lationship strength based on decayed message counts. More 
specifically, we define two separate and distinct notions of 
connection between nodes: edges and messages. We define 
an edge to be the unobservable probabilistic connection 
between two nodes, indicating whether the nodes have an 
active relationship. This is in contrast to messages: a mes- 
sage rtiij is a concrete and directly measurable communica- 
tion between two nodes Vi and Vj, such as a wall posting or 
email, occurring at a specific time, which we denote ((my). 
We define the probability of of nodes Vi and Vj having an 
active relationship at the current timestep t now , based on 
observing a message at time t(my), to be the exponential 
decay of a particular message: 



P(e t ij \m ij ) = Exp (mij\t now , A) 



exp 



Note that the scaling parameter A refers to the adjustment 
of the basic time unit (e.g. 7 days to 1 week), not the rate 
parameter which defines the exponential probability density 
function, which in this case is 1, This allows for assigning 
a probability of 1 to the case when t (my) = t now , but it 
also assigns reasonable probabilities (i.e., slows the decay) 
for messages that happened in the recent past, which could 
still indicate active relationships. 

Now, we assume we have k messages between Vj and 
Vj, and any of the messages mjj, . . . , m^- can contribute to 
the relationship strength, which is defined to be 1 minus the 
probability that none of them contribute: 



P 



i - n I 1 ~ ex p ( m yiw,)) 



In order to choose a scaling parameter A for the expo- 
nential decay, we measured the average correlation from the 
sampling method BCR against the time slice ranking and ag- 
gregate method for each Enron employee, for different val- 
ues of A (see FigureQ]a). Note that a A close to corresponds 
to 'forgetting' a transaction quickly and is highly correlated 
with the slice method, while a large A corresponds to 're- 
membering' a transaction for a long time, giving it high cor- 
relation with the aggregate method. In order to balance be- 
tween short term change and long term trends we set A to a 
'middle ground' with A = 28 days. This applies to both the 
Enron and Facebook datasets. For DBLP, where we evaluate 




Scaling Parameter i, (Days) 



Handicap Parameter fS 




(a) (b) (c) (d) (e) 

Figure 1: (a) Correlation between methods for varying values of A. (b) Correlation of MLH with other methods as f3 is varied, 
(c-e) Correlations of Enron employee BCRs across methods, for the time segment ending August 24 th , 2001 

yearly, A is set to 2 years to keep the ratio between time slice 
and A consistent between Facebook, Enron, and DBLP. 

In order to choose a value for the f3 parameter in the MLH 
method, we measured the average correlation of the BCR 
from the MLH method and compared them to the sampling, 
aggregate, and slice rankings for different values of (3. We 
can see in Figure Q]b that as long as j3 is non-zero, it has 
minimal effect on the correlations. For the experiments re- 
ported in this paper, we set f3 ~ .3. Note that omission of 
the prior (i.e., /3 = 1) in will make the MLH paths similar to 
the slice paths, with added paths between vertices which are 
disjoint in a particular time slice. 

The final parameter setting is the number of samples to 
consider in each of sampling-based measures. Earlier we 
discussed how we are computing the joint instances of short- 
est paths, and that the bound by (Pota mias et al. 2 009) does 
not hold. Due of this, we exploit the small size of the En- 
ron dataset and take 10,000 samples; however, with the two 
larger graphs we use a smaller sample size of 200 in order to 
make the experiments tractable. 

Method Correlations on Enron Data 

In order to illustrate the differences between the four meth- 
ods, we analyze their respective BCR on the Enron data for 
the time window ending August 14 , 2001. Figure Q~|c-e 
shows the correlations of employee BCR across a pair of 
methods: points on the diagonal green line indicate 'perfect' 
correlation between the rankings of two methods. 

Figure Q]c shows that the MLH method closely matches 
the sampling method, with only a few nodes varying from 
the diagonal. However, a large number of nodes that the sam- 
pling method determines to have high centrality are missed 
by the slice method, due to the slice's inability to see transac- 
tions that occurred prior to the evaluation time window. Ad- 
ditionally, we note that August 14 th , 2001 is relatively late 
in the Enron timeline, which results in the aggregate method 
having little correlation with the sampling method, since the 
more recent changes are washed out by past transactions in 
the aggregate approach. 



Local Trend Analysis 

Lay and Skilling Here, we analyze two key figures at En- 
ron: Kenneth Lay and Jeffery Skilling. These two were cen- 
tral to the Enron scandal — as first Lay, then Skilling, and 
then Lay again, assumed the position of CEO. We can an- 
alyze the BCR for Lay and Skilling during these transition 
periods, as we expect large changes to affect both of them. 



(c) (d) 

Figure 2: BCR of Lay and Skilling over time. Red lines in- 
dicate Skilling's CEO announcement and resignation. 

The first event we consider (marked by a vertical red line 
in Figure|2]i is December 13 th 2000, when it was announced 
that Skilling would assume the CEO position at Enron, with 
Lay retiring but remaining as a chairman ( Marks t. In Fig- 
ure |2] a, both the sampling method and the MLH method 
identify a spike in BCR for both Lay and Skilling directly 
before the announcement. This is not surprising, as presum- 
ably Skilling and Lay were informing the other executives 
about the transition that was about to be announced. 

The time slice method (f2]c) produces no change in Lay's 
BCR, despite his central role in the transition. Skilling shows 
a few random spikes of BCR, which illustrates the variance 
associated with using the time slices. The aggregate model 
(f2]d) fails to reduce Skilling's BCR to the expected levels 
following the announcement — this is fairly early in time and 
we are already seeing the aggregate method's inability to 
track current events based on its union of all past transac- 
tions. Both the sampling method and the MLH methods cap- 
ture this; MLH has him return to an extremely low centrality, 
while sampling has fairly low with some variance. 

The second event we consider (marked by the 2nd verti- 
cal red line in Figure EJi is August 14 th 2001, when, seven 
months after initially taking the CEO position, Skilling ap- 
proached Lay about resigning (IMarks 1 . During the entirety 
of Skilling's tenure, we see that Lay has a slight effect on 
the sample rankings but is not what would be considered 
a 'central' node. Not surprisingly, Skilling has a fairly high 
centrality during his time as CEO; both the sampling method 
and MLH method capture this. 
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Figure 3: (a,c,e) BCR of Kitchen and Lavorato. (b,d,e) BCR 
for 3 nodes in the Purdue Facebook network. 

Prior to the announcement of Lay's takeover as CEO, the 
slice method still had no weight on him, despite his previous 
involvement with the first transition. Also, we note that the 
sampling, MLH, and slice methods all agree that after Lay's 
initial spike from the Skilling resignation, he resumes hav- 
ing a lower centrality, which the aggregate method misses. 
In general, the sampling method seems to mirror the slice 
method, albeit with less variance, but it not as smooth as 
the MLH method, indicating the utility of considering most 
probable paths. 
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tives (Shetty and Adibi 2004 1 
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Lavorato Next we analyze Louise 
John Lavorato, who were execu- 
for Enron Americas, 
was the wholesale trading section of Enron 
( |Raghavan, Kranhold, and Barrionuevo | l. They are notable 
because of the extraordinarily high bonuses they received as 
Enron was being investigated, and were also found to have 
a high temporal betweenness centrality using the method 
defined by (Tang et al. 2010 1. We can see in Figure[3](a,c,e) 



the rankings of Kitchen and Lavorato, and can see the bene- 
fit of using the probabilistic framework's ability to key in on 
centralities at specific times, rather than using the temporal 
definition through time proposed by (Tang et al. 2010 1. We 
see that while Lavorato might have gotten a large bonus, 
he is only important during Skilling's tenure as CEO; his 
centrality drops noticeably otherwise. On the other hand, 
Kitchen had extremely high rankings throughout. 

Here, we see that the slice method exhibits high variabil- 
ity, especially with Kitchen, while the aggregate cannot rec- 
ognize Lavorato's lack of importance after Skilling's depar- 
ture. The MLH method is able to smoothly capture Kitchen's 



centrality, while keeping Lavorato important solely during 
Skilling's CEO tenure. 

Facebook Centrality Unlike the Enron dataset, the Pur- 
due Facebook dataset does not have well-established ground 
truths, where we can use the known characteristics and be- 
haviors of particular nodes for evaluation. However, we can 
examine aspects of a few representative nodes to illustrate 
the problems that lie with usage of the aggregate or static 
methods. First, we can see from Figure |3]d that v a (red) has 
a consistently high ranking in the slice method, which the 
MLH method captures (fSJ?). However, this person has a de- 
clining ranking in the aggregate method, as the aggregate is 
unable to capture current events — past information in the ag- 
gregate graph results in many paths that bypass v a , missing 
this central node in later timesteps. 

The next person we consider is denoted by Vb (green). In 
[3]d, we can see that the slice method initially identifies this 
person as having high centrality, then their BCR bottoms 
out, and then peaks a few times again approximately mid- 
way through the timeline. The MLH method also initially 
identify Vb as central, with a degradation over time. In con- 
trast, the aggregate method fails to detect the inactivity later 
in the timeframe and continues to give Vb a high centrality 
ranking throughout the entire time window. 

The final person we consider is denoted by v c (blue) in 
Figure [3] We can see in [3]d that the slice method exhibits 
large variability for v c , but that there are many slices in the 
middle to end of the timeframe where the node is identified 
as highly central. The aggregate method is unaware of this 
activity and ranks v c at a relatively low level throughout the 
timeseries. In contrast, the MLH method is able to recognize 
the node's growing importance as time evolves, and do so 
much more smoothly than the slice method ((3]d). In doing 
so, the MLH method can find instances of high centrality 
when both discrete methods fail. 

Global Trend Analysis 

In Figure [4] we report the average path lengths for the var- 
ious measures: MLH paths, probabilistic shortest paths, the 
aggregate shortest paths and the slice shortest paths. Ad- 
ditionally, we report the average sampled clustering coef- 
ficient, the clustering coefficient approximation, and the ag- 
gregate and slice discrete clustering coefficients. These are 
done for each of the three datasets through time, and we 
investigate changes in these global statistics to understand 
what, if any, changes occur with respect to the small world 



network structure of the data (Watts and Strogatz 1998 



In Figures|4]a,c,e, we show the clustering coefficients for 
each of the three datasets. The aggregate graph significantly 
overestimates the amount of current clustering in the graph, 
while the slice method is highly variable, especially for En- 
ron. In general, both probabilistic measures are in between 
the two extremes, balancing the effects of recent data and 
decreasing the long term effect of past information, with the 
MLH performing similarly to the sampled clustering coeffi- 
cient, and even better on DBLP, where sampling undercuts 
the clustering (likely due to small sample size). 
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Figure 4: Average path lengths and clustering coefficients 
for Enron (a,b), DBLP (c,d) and Facebook (e,f). 



Next, in Figures [4]b,d,f, 
ing diameter of these 



we examine the shrink- 
small world networks 
( |Leskovec, Kleinberg, and Faloutsos 2005 1. Here, the 
aggregate underestimates the path length at a current point 
in time. We can see that the most probable paths closely 
follows the sampling results, with both lying between the 
slice and aggregate measures while avoiding the variability 
of the slice method. 

Conclusions 

In this paper we investigated the problem of calculating cen- 
trality and clustering in an uncertain network, and analyzed 
our methods using time evolving networks. We demon- 
strated the limitation of using an aggregate graph represen- 
tation to capture uncertainty in the network structure due to 
changes over time, as well as the limitation of using a slice- 
based representation due to its extreme variability. We in- 
troduced sampling-based measures for average shortest path 
and betweenness centrality, as well as measures based on 
the most probable paths, which are more intuitive for cap- 
turing network flow. We also outlined exact methods for the 
computation of most probable paths (and by extention, most 
probable betweenness centrality), and incorporated the no- 
tion of transmission probability. Additionally, we developed 
a probabilistic clustering coefficient and gave a first order 
Taylor expansion approximation for computation. 

We provided empirical evidence on the Enron, DBLP, and 
Facebook datasets showing the sampling and MLH's intu- 
itive centrality rankings for the Enron employees and Face- 
book members, as well as the global properties for all three. 
The probabilistic centrality and clustering formulations are 
inherently smoother than the measures computed from dis- 



cretized time slices, however they can reason about likely 
change in graph structure due to changes over time, un- 
like the aggregate method, which includes all past infor- 
mation. We see the MLH formulation is smoother than the 
sampling method, indicating that the most probable paths 
through the graph may be more important to consider than 
shortest paths. Finally, we note that our experiments used 
a relatively simple estimate of relationship strength for the 
edge probabilities in the network. In future work we will in- 
vestigate alternative formulations of edge uncertainty. 
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